Introduction

The concept of shared economy has been a hot topic in recent year. Airbnb, the leader in shared housing field, would be a great example of how it improve people’s life and make booking accommodation much easier than before. In this project, we are interested in the host listings information in Washington, DC. from Airbnb. This is an interesting project to us as it would be helpful for tourists to make decision on accommodation when they plan to travel to DC. Meanwhile, the result is also useful for hosts who are looking for improvement on their listing performance. We generate maps to study some characteristics of Airbnb listings in DC. Also, we did research on factors that are associated with the popularity of a property listed in Airbnb.

Motivation

Nowadays, more and more people tend to use Airbnb to save budget, make friends and, most importantly, experience a city like a local. In the meantime, more rooms are cleaned, decorated and leased for small fortunes or just fun. With our thorough study of Airbnb data and resulting visualization, we hope that people could use Airbnb more efficiently, maximizing their benefit both as travelers and landlords.

Stakeholders:

Our project offers useful information for both hosts and guests in Washington, DC. Hosts would find it helpful since they can improve their room and align with customers’ demand to raise higher profit or increase popularity. Meanwhile, guests will be able to have a broader view of the housing market in Washington. Our study has revealed some popular regions that people like to live in. This would help guests better prepare for their trip to DC.

Choice of Dataset

The dataset we use is from Inside Airbnb (http://insideairbnb.com/get-the-data.html). The website provides data publicly available from Airbnb. The reason why we choose this dataset is because it contains well-organized comprehensive data ready to be used directly in data analysis. We choose to focus on analyzing all the rooms provided in Washington D.C. because this is a popular tourist city with large potentials on Airbnb development and expansion. The fields we would like to focus on are:

  1. Descriptions and Titles: For extracting keywords to form an overview of a certain neighborhood
  2. Latitude and Longitude: locate all available rooms and study the common traits of them
  3. Neighbourhood: the neighbourhood of a room indicates if travelers can get to the place they want to visit.
  4. Room types: are room types an important factors people consider when choosing an Airbnb listing.
  5. Price: price is a crucial measure on the demand and supply in a free market
  6. Amenities that are available in a room: this could affect people’s decision on choosing a room.
  7. Availability: a quantitative measurement of the popularity of a room

Read Data

In [4]:
# Import all the necessary packages
import csv
import pandas as pd
import numpy as np
import nltk
import string
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import seaborn as sns
import string
from nltk.corpus import stopwords
from scipy.misc import imread
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import random
import folium
from folium.plugins import MarkerCluster, ScrollZoomToggler, HeatMap
import json
import os
import statsmodels.stats.api as sm
from statsmodels.formula.api import ols
import statsmodels.graphics
import statsmodels.formula.api as smf
from jinja2 import Template
from branca.element import MacroElement
import branca.colormap as cm
import folium.plugins
import matplotlib.axes as ax
%matplotlib inline
In [6]:
# Read csv files
with open('wdc_listings.csv') as csvfile:
    df = pd.DataFrame.from_csv(csvfile)
df.head()

##Modify the data set
df['occupied_percent'] = (365 - df['availability_365'])/ 365 * 100 
In [7]:
def isType(entry, type_want):
    """The function change original variabel into a dummy variable specifying if a statement is right
    argument: tf is a chatacter specifying true or false
    output is a interger """
    if (entry == type_want):
        return 1 
    else:
        return 0

def changeAmenity(entry):
    entry = str(entry)
    entry = entry.strip('{')
    entry = entry.strip('}')
    entry= entry.strip('')
    ame_list = entry.split(',')
    ame_list_new = [[i.strip('"') for i in ame_list]]
    return ame_list_new
    
df['host_is_superhost'] = [isType(i, 't') for i in df['host_is_superhost']]
df['host_has_profile_pic'] = [isType(i, 't') for i in df['host_has_profile_pic']]
df['host_identity_verified'] = [isType(i, 't') for i in df['host_identity_verified']]
df['entire_home'] = [isType(i, 'Entire home/apt') for i in df['room_type']]
df['private_room'] =  [isType(i, 'Private room') for i in df['room_type']]
df['require_guest_phone_verification'] = [isType(i, 't') for i in df['require_guest_phone_verification']]
df['cancel_strict'] = [isType(i, 'strict') for i in df['cancellation_policy']]
df['instant_bookable'] = [isType(i, 't') for i in df['instant_bookable']]
df['host_response_rate'] = [ float(str(i).strip('%')) for i in df['host_response_rate']]
df['host_acceptance_rate'] = [ float(str(i).strip('%')) for i in df['host_acceptance_rate']]
df['cleaning_fee'] = np.nan_to_num(df['cleaning_fee'])
df['security_deposit'] = np.nan_to_num(df['security_deposit'])
df['amenities'] = [changeAmenity(i) for i in df['amenities']]
In [34]:
pop = pd.DataFrame(df.groupby('neighbourhood_cleansed')['occupied_percent'].mean())
pop.reset_index(pop, inplace=True)
#pop.head()

pop_price = pd.DataFrame(df.groupby('neighbourhood_cleansed')['price'].mean())
pop_price = pop_price.sort_values(by = 'price', ascending = [0])
pop_price.reset_index(pop_price, inplace=True)

pop_count = pd.DataFrame(df.groupby('neighbourhood_cleansed')['name'].count())
pop_count.reset_index(pop_price, inplace=True)
#pop_count

Map of Room Availablity for each Neighborhood

In [5]:
list_loc = [[row['latitude'], row['longitude'], row['occupied_percent'], row['price'], row['neighbourhood_cleansed']] for index, row in df.iterrows()]
In [33]:
### numbers of listings
DC_COORDINATES = (38.9072, -77.0369)
dc = os.getcwd() + '/wdc.geojson'
m = folium.Map(DC_COORDINATES, zoom_start = 12)

marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)

for location in list_loc[:]:
    folium.Marker(location = location[:2], icon=folium.Icon(color='red'),
                 popup = str(location[3])).add_to(m).add_to(marker_cluster)

m.choropleth(geo_str = open(dc).read(), 
    data=pop_count,
    columns=['neighbourhood_cleansed', 'name'],
    key_on='properties.neighbourhood', 
    fill_color='GnBu',
    threshold_scale=[0, 30, 60, 100, 250, 500],
    fill_opacity = 0.85,
    line_weight = 1.2,
    legend_name = "Number of listings in a District")

toggler = ScrollZoomToggler().add_to(m)

m
Out[33]:

We want to display the price distribution of Airbnb within the DC region. In order to make it easy to see, we made a DC map and used different colors to represent the density of the listing. Dark blue represents high density while light green represents low density. Each marker represents one Airbnb post, and the number on the marker shows the daily price for that specific post. As we can see from the map, the density of Airbnb housing is more centered within the central DC area, especially the northern region of White House and Lincoln Memorial. According to the map, we can see there are 1312 posting within that region. After we look into the pricing within every neighborhoods, we found out that in central area, the price range varies from 200 to 400 dollars, while in outer areas, the daily prices are raletively lower, for less than 100 dollars. In the following bar chart, we will demonstrate the average price for each neighborhood in a clearer fashion.

In [35]:
sns.color_palette("Blues")
sns.barplot(y='neighbourhood_cleansed', x='price', data =pop_price.head(10))

plt.title('Neighbourhood With Highest Airbnb Listing Price')
plt.xlabel('Price ($)')
plt.ylabel('Neighbourhood')
plt.show()

As what we have seen on the bar chart above, we can clearly see that 'Georgetown, Burleith/Hillandale" reagion has the highest average daily price for Airbnb listing, while "Shaw, Logan Circle" region has the lowest average daily price for the listing. Such result fits with the conclusion we have drawn in the map above.

In [36]:
### occupied rate(measuring popularity)
DC_COORDINATES = (38.9072, -77.0369)
dc = os.getcwd() + '/wdc.geojson'

m = folium.Map(DC_COORDINATES, zoom_start = 12)
marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)
for location in list_loc[:]:
    folium.Marker(location = location[:2], popup = str(location[2])).add_to(m).add_to(marker_cluster)

m.choropleth(geo_str = open(dc).read(), 
    data=pop,
    columns=['neighbourhood_cleansed', 'occupied_percent'],
    key_on='properties.neighbourhood', 
    fill_color='BuPu',
    threshold_scale=[0, 10, 20, 30, 40, 100],
    fill_opacity = 0.85,
    line_weight = 1.2,
    legend_name = "Occupied Rate(%)")

toggler = ScrollZoomToggler().add_to(m)
    
m
Out[36]:

We want to see the distribution of occupation rate for DC Airbnb listing. Similar to the prevous map, darker color is referring to higher density of the Airbnb posting within the region. Each marker represents one single Airbnb post, and the number accompanies with the marker is representing the percentage of the occupation rate within 365 days. We make a guess that central regions might have higher occupation rate while outer regions have less occupation rate. Because we assume that most guests are there for short term visit, therefore it is likely that they would have higher probability of staying in the central areas while it is more convenient and closer to the place of interests. While looking into the map, we found that the occupation rate varies in each neighborhood, and it is not guarenteed that central areas are having higher occupation rate than the outer regions. Therefore we will look into this data further and see what are the elements that influence the ‘popularity’ of the posting.

In [8]:
### price
m = folium.Map(DC_COORDINATES, zoom_start = 12)

marker_cluster = folium.MarkerCluster("Cluster Name").add_to(m)
for location in list_loc:
    folium.Marker(location = location[:2], icon=folium.Icon(color='blue'),
                        popup = str(location[3])).add_to(m).add_to(marker_cluster)
m.choropleth(geo_str = open(dc).read(), 
    data=pop_price,
    columns=['neighbourhood_cleansed', 'price'],
    key_on='properties.neighbourhood', 
    fill_color='OrRd',
    threshold_scale=[0, 50, 75, 100, 150, 300],
    fill_opacity = 0.85,
    line_weight = 1.2,
    legend_name = "Price")
toggler = ScrollZoomToggler().add_to(m)    

m
Out[8]:

In order to see the distribution of price in a more direct way, we made the third map. The color indicates the level of pricing within the area. As we can see from the index bar, the darker the color indicates higher average price of the region. In the previous map, we conclude that central part of the DC might have higher price, and outer region of the DC would have lower price. According to our observation in this map, we found that this might not be absolutely true. As we can see, central region and the regions around Potomac River are having higher Airbnb price, while south earthen region and northern region of the DC are having lower Airbnb price. Later we will use a multi linear regression model to find out which elements are the indicators for posts’ popularity.

In [9]:
#sns.distplot(df['occupied_percent'], bins=None, hist=True)
plt.hist(df['occupied_percent'], 50, normed=1, facecolor='purple', alpha=0.75)
plt.title('DIstribution of Occupied Percentage of Listings')
plt.xlabel('Occupied Rate')
plt.ylabel('Proportion')
plt.show()

In this graph, we are trying to show the distribution of occupied percentage of the listings. In the chart, we have 50 bars, and each bar represents 2% of the occupation rate. As we can see from the chart, most listings are having low occupation rate. There are around 7% Airbnb hosts are having less than 2% occupation rate within last 365 days. But the graph also indicate that there are about 3% of the listings are having almost 100% occupation rate, which indicate that there are only a few housings that have extreme popularity in DC region. Now we want to compare the most popular and the least popular postings.

We decide to look into the listig that has over 75% of occupancy rate and compare them with those with below 10% occupancy.

In [10]:
above75 = df.query('occupied_percent >= 75')
below10 = df.query('occupied_percent <= 10')
In [37]:
plt.boxplot([above75['price'], below10['price']], 0 ,'')
plt.title('Distribution of Accomodates of Listings')
plt.xlabel('Group')
plt.ylabel('Price')
plt.xticks(range(1,3),['Above 75%', 'Below 10%'])
plt.show()

In the box plot above, we can see the price distribution for listing that has over 75% occupation rate and the price distribution for listing that has less than 10% occupation rate. Both of them are having relatively the same average price, while the listing with higher than 75% occupation rate is having smaller variation than the listing with lower than 10% occupation rate. It is surprising to see that the listing with less than 10% occupation rate is having a higher maximum price. But it is not true to say that lower price indicates a higher popularity.

In [12]:
plt.hist(above75['accommodates'], 15, normed=1, facecolor='blue', alpha=0.25, label='Above 75%')
plt.hist(below10['accommodates'], 15, normed=1, facecolor='green', alpha=0.25, label='Below 10%')
plt.legend(loc='upper right')
plt.title('Distribution of Accomodates of Listings')
plt.xlabel('Accommodates')
plt.ylabel('Proportion')
plt.show()

In this graph, we wish to see the relationship between the max number of accommodates and the popularity of the listing. The purple color is representing listings that are above 75%, and the green color is representing listings that are below 10%. As we can see both histogram are having similar shapes, both of them are skewed to the right, which is indicating that there are less postings have high number of accommodates (greater than 10 ppl). According to the graph, we can see that for 2 accommodates, there are higher proportion of listings for those above 75% than those less than 10%. So we can conclude that popular posts are roughly with 2 to 3 accommodates.

In [13]:
above_list = sum(list(above75['amenities']), []) 
result = sum(above_list, [])
result = pd.DataFrame(result)
result.reset_index(result, inplace=True)
ame_count = result.groupby(0).count()
ame_count.reset_index(ame_count, inplace=True)
ame_count = ame_count.sort_values(by='index', ascending=[0])
ame_count['percent'] = ame_count['index'] / above75.shape[0] * 100
In [39]:
sns.barplot(y=0, x='percent', data =ame_count.head(15))
plt.title('Top Amenities in Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Amenties')
plt.show()
In [15]:
below_list = sum(list(below10['amenities']), []) 
result = sum(below_list, [])
result = pd.DataFrame(result)
result.reset_index(result, inplace=True)
ame_count_b = result.groupby(0).count()
ame_count_b.reset_index(ame_count_b, inplace=True)
ame_count_b = ame_count_b.sort_values(by='index', ascending=[0])
ame_count_b['percent'] = ame_count_b['index'] / below10.shape[0] * 100
In [16]:
sns.barplot(y=0, x='percent', data =ame_count_b.head(15))
plt.title('Top Amenities in Less Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Amenties')
plt.show()

Now we want to see the relationship between Amenities and popularities. As what shown above, we have two bar charts, one is the percentage of the Amenties for popular listings (with over 75% occupancy rate), another one is the percentage of the Amenties for non-popular listings (with less than 10% occupancy rate). As what we can see from those two graphs, both of them are having similar shapes, while there is higher percentage of AC and lower percentage of heater among the popular listings. Overall, those two graphs do not offer much useful information regarding the indicator of the popularity.

In [17]:
neighbour1 = above75.groupby('neighbourhood').count()
neighbour1 = neighbour1.sort_values(by = 'name', ascending = [0])
neighbour1['name'] = neighbour1['name']/ above75.shape[0] *100
neighbour1.reset_index(neighbour1, inplace=True)
#neighbour1.head(10)
In [18]:
neighbour2 = below10.groupby('neighbourhood').count()
neighbour2 = neighbour2.sort_values(by = 'name', ascending = [0])
neighbour2['name'] = neighbour2['name']/ below10.shape[0] * 100
neighbour2.reset_index(neighbour2, inplace=True)
#neighbour2.head(10)
In [40]:
sns.barplot(x='name', y='neighbourhood', data =neighbour1.head(10))
plt.title('Top Regions With Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Neighbourhood')
plt.show()
In [41]:
sns.barplot(x='name', y='neighbourhood', data =neighbour2.head(10))
plt.title('Top Regions With Less Popular Listings')
plt.xlabel('Percent')
plt.ylabel('Neighbourhood')
plt.show()

In those two graphs above show the relationship between the region and the popularity of the posting. As we can see in the first graph, which shows the top regions for popular listings, there are more popular postings in “Columbia Heights” regions and little listings in Kalorama region. While in the second graph, it shows there are lots unpopular listings within “Capitol Hill” region, also “Columbia Heights” region is the second unpopular region. There we know that within one region, it can have lots popular and lots unpopular listings at the same time. Therefore we need to look into other elements and see if there is a single indicator deciding the popularity of a single posting.

In [42]:
# above75 description Wordcloud
description = above75.dropna(subset = ['description'])
words = str.join(' ', description.description)

stop = set(stopwords.words('english'))
stop.add('room')

f = open('description.txt','w')
f.write(words)
logo = np.array(Image.open("Airbnb-logo.png"))
image_colors = ImageColorGenerator(logo)


text = open('description.txt').read()
wc = WordCloud(stopwords=stop).generate(text)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()
In [43]:
# below10 description Wordcloud
description = above75.dropna(subset = ['description'])
words = str.join(' ', description.description)

stop = set(stopwords.words('english'))
stop.add('room')

f = open('description.txt','w')
f.write(words)
logo = np.array(Image.open("Airbnb-logo.png"))
image_colors = ImageColorGenerator(logo)


text = open('description.txt').read()
wc = WordCloud(stopwords=stop).generate(text)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis("off")
plt.show()

Now we want to see if the words in description influence the popularity of the listing. As what have shown above, the words such as “apartment”, “bedroom”, “metro” are frequently used in both popular and non-popular listings. In the graph, the bigger the words shows the more frequent the word appears in the ‘description’ section. By comparing both word cloud, the unpopular listings tend to mention the word ‘metro’ and ‘kitchen’ more. Overall both popular and non-popular listings are using similar words in the ‘description’ section. Later we will make anther word cloud to show which words are most used in ‘description’ section overall.

Top 10 Neighborhood with Highest Average Room Price

In [23]:
# Get the average price
neighbourhood =df.dropna(subset = ['neighbourhood'])
neighbourhood_price = neighbourhood.groupby('neighbourhood').mean()
neighbourhood_price['neighbourhood'] = neighbourhood_price.index
neighbourhood_price = neighbourhood_price.sort_values('price', ascending = False)
In [24]:
# Draw the bar chart
sns.barplot(y='neighbourhood', x='price', data = neighbourhood_price[:10])
plt.title('Top 10 Neighborhood with Highest Average Room Price')
plt.xlabel('Price')
plt.ylabel('Neighbourhood')
plt.show()

As presented in this barchart, Peasant Hill is the most expesive place listed in Airbnb in Wshington, DC. The average prices is almost \$600, it could be that there are more single houses are posted in that area. The second highest average price is at Hillcrest with amost $400 per night. After looking into the places with high price, we realize that these neighbourhoods are not closed to downtown. They are either the Northeast side of the DC or closed to mountain and hill, which could be an indicator of local people's residental patterm in DC. Genegrally, the places listed at a higher price are not close to downtown, so their main taget customers are not tourists.

In [25]:
# Stacked Bar
Top10Neigh = neighbourhood_price[:10]['neighbourhood']
neighbourhood_no = neighbourhood.groupby(['neighbourhood', 'room_type']).count()
neighbourhood_no= neighbourhood_no[['listing_url']]
neighbourhood_no = neighbourhood_no.unstack(level=1)
neighbourhood_no = neighbourhood_no.loc[[w[0] in Top10Neigh for w in neighbourhood_no.iterrows()], :]
neighbourhood_no = neighbourhood_no.T
for n in Top10Neigh:
    neighbourhood_no[[n]] = neighbourhood_no[[n]]/neighbourhood_no[[n]].sum()
neighbourhood_no = neighbourhood_no.T
In [26]:
p1 = plt.barh(np.arange(neighbourhood_no.shape[0]),  neighbourhood_no[[0]].values, color='#F9AA8D')
p2 = plt.barh(np.arange(neighbourhood_no.shape[0]),  neighbourhood_no[[1]].values, left=neighbourhood_no[[0]].values, color='#F7C78E')
p3 = plt.barh(np.arange(neighbourhood_no.shape[0]),  neighbourhood_no[[2]].values, left=neighbourhood_no[[0]].values + neighbourhood_no[[1]].values, color='#F7DD8C')

plt.xlabel('Percentage')
plt.ylabel('Neighbourhoods')
plt.yticks(np.arange(neighbourhood_no.shape[0]),  neighbourhood_no.index.values)
plt.title('Three Room Types Distribution of the Top 10 Highest Price Neighbourhoods', y=1.08)
plt.legend((p1[0], p2[0],p3[0]), ('Entire', 'Private', 'Shared'),bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

This stacked plot is showing the patterm of room types of these most expensive places. As we can see that most of the properties listed are posted to be booked as an entire house. A small portion of the properties are listed as private room and share public space with the hosts. Nearly no rooms are listed as shared rooms. This confirms that our speculation above that the man target of these properties are high-end customers instead of guests who come to DC as tourists.

Word Cloud of High Frequency Words in Names and Descriptions

In [27]:
description = df.dropna(subset = ['description'])
words = str.join(' ', description.description)
words

stop = set(stopwords.words('english'))
stop.add('room')

f = open('description.txt','w')
f.write(words)
text = open('description.txt').read()

logo = np.array(Image.open("Airbnb-logo.png"))

wc = WordCloud(background_color="white", max_words=500, mask=logo, stopwords=stop)
# generate word cloud
wc.generate(text)
image_colors = ImageColorGenerator(logo)
wc.recolor(color_func=image_colors)

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

Lastly, we try to make a word cloud that describe which words are frequently used in ‘description’ section overall. In the word cloud above, it shows that words such as “apartment”, “bedroom”, “metro”, “kitchen” are frequently used in the “description” section.

In [8]:
rcode = """occupied_percent ~ price + host_response_rate + 
        entire_home + private_room + guests_included + bathrooms + 
        bedrooms + cancel_strict + 
        number_of_reviews + review_scores_location + 
        review_scores_value"""

lm = smf.ols(formula= rcode, data=df).fit()
print lm.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:       occupied_percent   R-squared:                       0.077
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     20.10
Date:                Wed, 22 Mar 2017   Prob (F-statistic):           1.73e-39
Time:                        23:49:45   Log-Likelihood:                -12913.
No. Observations:                2644   AIC:                         2.585e+04
Df Residuals:                    2632   BIC:                         2.592e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                -46.5317     10.169     -4.576      0.000       -66.471   -26.592
price                     -0.0893      0.011     -8.103      0.000        -0.111    -0.068
host_response_rate         0.2364      0.045      5.213      0.000         0.147     0.325
entire_home               22.1579      3.992      5.550      0.000        14.330    29.986
private_room               9.1541      3.954      2.315      0.021         1.402    16.906
guests_included           -2.0569      0.569     -3.615      0.000        -3.173    -0.941
bathrooms                  3.4556      1.582      2.184      0.029         0.353     6.558
bedrooms                   4.5006      1.218      3.696      0.000         2.113     6.888
cancel_strict             -5.8064      1.347     -4.311      0.000        -8.448    -3.165
number_of_reviews         -0.0604      0.019     -3.125      0.002        -0.098    -0.023
review_scores_location     3.0420      0.793      3.836      0.000         1.487     4.597
review_scores_value        2.2440      0.839      2.675      0.008         0.599     3.889
==============================================================================
Omnibus:                      416.693   Durbin-Watson:                   1.951
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              321.654
Skew:                           0.757   Prob(JB):                     1.42e-70
Kurtosis:                       2.207   Cond. No.                     3.03e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Model Result

To generate some quatitative results to further estimate the factors that are associated with the popularity of an Airbnb post, we use a multi-linear regression model to mesure each factors' contribution to the popularity of a room listed on Airbnb. Eventually, we came up with this this model.

The factors that we thought are influential included price, room type, host response rate, superhost effect, whether the host has a profile picture, number of people allowed, bathrooms, bedrooms and beds, amount of security deposit and cleaning fee, cancellationpolicy, number of reviews, reviews on defferent aspects. In the end, we have selected the model as stated above. The parameters taht are significant on impacting the popularity of a room. We chose to measure the occupied percentage of tidays within a day as a measurement of the popularity of a room. It is continuous, and this we decide to use multi-linear regression to estimate the impact.

Based on the model result, we see that the significant variables that are cnsidered significant are: price, host response rate whetehr the property is listed as entire home, whether the room is private, number of guests included, number of bathroom, bedrooms, cancellation policy, number of reviews, review scores of location and value. Within these variables, some positive factor include:

  1. host response rate: a 1% increase in host response rate is associated with 0.2364% occupied percentage within a year.
  2. entire home: a property listed as an entire home is 22.1579% more likely to be occupied with a year than a shared room.
  3. private room: a prvate room is 9.1541% more likely to be occupied than a shared room.
  4. number of bathrooms: a room with one more bathroom is 3.4556% more likely to be booked.
  5. number of bedrooms: a room with one more bedroom is 4.5006% more liky to be booked.
  6. review scores of location: a room with one point higher of review score of location is 3.0420% more likely to be booked.
  7. review scores of value: a room with one point higher if review score of value is 2.244% more likely to be booked.

Meanwhile, some negative factors that contrubuting to this model include:

  1. price: a room that is $1 more expensive are potentially 0.893% less likely to be booked
  2. guests included: a room withmore mre guest
  3. strict cancellation policy: a room with strict cancellation ppolicy is likely to be associated with a 5.8064% decrease in occupied time percentage.
  4. number of reviews: ten more reviews are likely to be associated with a 0.604% decrease of likelihood tobe booked on Airbnb.
Model Interpretations

Form the result above, we have ome insights into guests decision making process and the points that they pay attention to. Price is obviously an importnat factor as people all like to save money. At the same time, they also focus on privacy, Rooms with more privacy are more popular than shared rooms. The result demonstrates that people travel to DC tends to be small groups with in two to three people. Therefore, a larger room is not as desirable as those that just holds that right number of a group. They also consider bathroom a crucial deather for a room listed on Airbnb. People are more likely to book a room woth more bathroom and bedroom. Most importantly, connection s formed on Airbnb is also a factor influencing people's decision. Higher review on value and lcation are attracting factors for guests.

Discussion of the Model

The model coefficiemts of the most likely to be influential make sense in our result. Yet the R square is small with 7.7% explanatory power. This means that we have missed some significant factor that rae not in our data set. For example, location is an important factor of the decision-making process when a tourist choose a place to live. This is excluded form the linear regression model as a categorical variabble with over 30 levels is meaningless in a linear model. Including neighbourhood will require a more advanced model to quantify the effect. Moreover, another factor that is not noticed within the model is the picture that are posted in the website of Airbnb. The picture of a room is the first impression fo a guest and is mostly to effect people;s decision since people generally pay more attention topicture instead of words. With only a url linkto the picture, we are unable to quantify the quality of an image of a property.

Project Conclusion

In this project, our main target is to find out the characteristics of popular rooms listed on the Airbnb. We first looked into the distribution of rooms in Washington, DC. The rooms that are more popular and more expensive are mainly near the center of Washington, DC. Meanswhile, we inpected two groups of rooms, one is more popular and the other is rarely booked. Result shows that the description information presented online is not the main difference between these two group. Also, model results show that several factors are positively and negatively affacting the popularity of a room listed on Airbnb. More research is needed to be done as we believe that some important facators are missed withinour model.

In [ ]: